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Recent development of high-resolution mass spectrometry (MS) instruments en¬ 
ables chemical cross-linking (XL) to become a high-throughput method for obtaining 
structural information about proteins. Restraints derived from XL-MS experiments 
have been used successfully for structure refinement and protein-protein docking. 
However, one formidable question is under which circumstances XL-MS data might 
be sufficient to determine a protein’s tertiary structure de novof Answering this 
question will not only include understanding the impact of XL-MS data on sampling 
and scoring within a de novo protein structure prediction algorithm, it must also 
determine an optimal cross-linker type and length for protein structure determination. 
Whereas a longer cross-linker will yield more restraints, the value of each restraint 
for protein structure prediction decreases as the restraint is consistent with a larger 
conformational space. 

In this study, the number of cross-links and their discriminative power was system¬ 
atically analyzed in silico on a set of 2055 non-redundant protein folds considering 
Lys-Lys, Lys-Asp, Lys-Glu, Cys-Cys, and Arg-Arg reactive cross-linkers between 1A 
and 60 A. Depending on the protein size a heuristic was developed that determines 
the optimal cross-linker length. Next, simulated restraints of variable length were 
used to de novo predict the tertiary structure of fifteen proteins using the BCL::Fold 
algorithm. The results demonstrate that a distinct cross-linker length exists for 
which information content for de novo protein structure prediction is maximized. 
The sampling accuracy improves on average by 1.0 A and up to 2.2 A in the most 
prominent example. XL-MS restraints enable consistently an improved selection of 
native-like models with an average enrichment of 2.1. 
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1 Introduction 


“Structural Genomics” — the determination of the structure of all human proteins — would have 
profound impact on biochemical and biomedical research with direct implication to functional 
annotation, interpretation of mutations, development of small molecule binders, enzyme design, 
or prediction of protein-protein interaction. 1 Although significant progress towards this goal has 
been made through X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy, 
tertiary structure determination continues to be a challenge for many important human proteins. 
At present, high-resolution structures exist for about 5% of all human proteins in the Protein 
Data Bank (PDB). 2 For many uncharacterized human proteins, construction of a comparative 
model is possible starting from the experimental structure of a related protein. Nevertheless, for 
about 60% (~ 7800) of known protein families in the Pfam databasd 3 not a single structure 
is deposited. 4 Many of these proteins will continue to evade high-resolution protein structure 
determination. 

Accordingly, researchers strive to develop alternative approaches. The most extreme approach 
includes computational methods that predict the tertiary structure of proteins from their 
sequence alone. Although computational methods are sometimes successful at the predicting 
the tertiary structure of small proteins with up to one hundred residues,® for larger proteins 
the size of the conformational space to be searched as well as the discrimination of incorrectly 
folded models hinder structure prediction. 6 8 

However, recent studies demonstrate that combining de novo protein structure prediction 
with limited experimental datapHEI e . experimental data that alone is insufficient to unam¬ 
biguously determine the fold of the protein, can yield accurate models for larger proteins. The 
structural restraints in those studies were acquired using electron paramagnetic resonance (EPR) 
spectroscopy,®®® 2 electron microscopy (EM), 13 14 or NMR spectroscopy. 15 

As an alternative technique, XL in combination with MS can be applied to obtain distance 
restraints, which can be used to guide protein structure prediction.® 6 ® 9 Using bifunctional 
reagents with a defined length, functional groups within the protein can be covalently bridged 
in a native-like environment. Thus, it is possible to determine an upper limit for the distance 
between those residues after enzymatic proteolysis and identification of cross-linked peptides. 

This method allows for a fast analysis of protein structures in a native-like environment at a low 
concentration and can even be applied to high molecular weight proteins,® 0 membrane proteinsf 21 ^ 
or highly flexible proteins. 22 If combined with affinity purification it becomes possible to study 
proteins inside the cell. 23 Currently, the XL-MS technology is rapidly gaining importance driven 
by the liquid chromatography (LC)-MS instrument development, the generation of advanced 
analysis software,^ and the direct integration in protein structure prediction workflows. 25 27 
Furthermore, hundreds of different cross-linking reagents with different spacer lengths, reactivities, 
and features for specific enrichment and improved detectability are now commercially available P 8 ^ 

However, whereas the potential to combine XL-MS and computational modeling has been 
frequently demonstrated and many technical problems of XL-MS have been solved, several 
central questions have not yet been evaluated systematically. 

(i) Cross-linking reagents are available with a spacer length ranging from 0 A to more than 
35 A. Whereas longer reagents are likely to provide more distance restraints, shorter cross¬ 
links have higher information content in de novo structure prediction as the conformational 
search space is more restricted. Thus, the question arises, which cross-linker spacer length 
supports structure prediction best? 

(ii) Cross-linking results are often used to confirm already existing structures. However, what is 
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the average gain in model accuracy and selection of correct models when using cross-linking 
data in conjunction with de novo protein structure prediction? 

(iii) Cross-linking reagents vary in reactivity towards different functional groups present in 
different amino acids. For de novo protein structure prediction, what is the gain of using 
additionally cross-linkers with different reactivities? 

In this study, we simulated cross-linking experiments on more than 2000 non-redundant 
protein structures to determine the number of possible and structurally relevant cross-links 
depending on the size of the protein as well as on the length and reactivity of the applied 
cross-linking reagents. We then tested the impact of cross-linking restraints on de novo protein 
structure prediction for fifteen selected proteins. 

2 Materials and methods 

2.1 Software and databases 

A subset of the PDB containing 2055 non-redundant protein structures was downloaded from 
the PISCES server (Version 08.2012) P 9 This PDB subset was created by filtering all available 
structures with a resolution of at least 1.6 A, a maximum sequence identity of 20%, and an 
R-factor cutoff of 0.25. Euclidean distances and shortest solvent accessible surface (SAS) path 
lengths between C^-C^, Lys-Nz-Lys-Nz, Lys-Nz-Asp-C 7 , and Lys-Nz-Glu-C^, as well as Arg- 
Nn 2 -Arg-Nn 2 and Cys-Sc-Cys-Sc atom pairs with a maximum intramolecular distance of 60 A 
were determined through the command line version of Xwalk. 30 

2.2 Generation of sequence dependent distance functions 

Tables containing the Euclidean distances and the sequence separation between cross-linking 
target amino acids (i) Lys-Lys, (ii) Lys-Asp, (iii) Lys-Glu, (iv) Arg-Arg, and (v) Cys-Cys 
were generated. Amino acid pair distances were sorted into 2.5 A bins. The total number of 
observed pairs for each sequence and Euclidean distance was counted. Based on the result an 
approximation of the distance distribution for every sequence distance was created. The median 
of the distribution was determined. A logarithmic function was calculated as a regression curve 
in the form E me d = a x ln(S ) + b to correlate the sequence separation S to the median Euclidean 
distances E me d. 

2.3 Calculation of the amino acid side-chain length 

Based on the structure of calmodulin (PDB entry 2KSZ) the average C^-Nz, C^-C 7 , Cp-Cg, 
C/?-Nh 2 , and C^-Sq distances of the side-chains of lysine, aspartic acid, glutamic acid, arginine, 
and cysteine were determined to be 4.5 A, 2.3 A, 3.6 A, 5.1 A, and 1.8 A, respectively. 

2.4 Distinguishing impossible, possible and structurally valuable cross-links 

Cross-linker spacer lengths between 1 A and 60 A distances were evaluated and classified in either 
(i) impossible cross-links, meaning that the distance between the Cg-atoms of the cross-linked 
amino acids exceeds the sum of the spacer lengths and the side-chain lengths, or (ii) possible 
cross-links, meaning that the C^-C^ distance is below the sum of the spacer lengths and side- 
chain lengths. The latter group was subdivided into cross-links potentially useful for structure 


3 



Euclidean Distance [A] Sequence Separation (AA) 

Figure 1: Residue pair distance distributions. (A) Distribution of the number of Lys-Lys pairs in 
respect to their Euclidean distance and (B-D) functions representing the relationship between sequence 
and spatial distance approximated by method of least squares to a logarithmic equation for (B) Lys-Lys, 
(C) Lys-Glu, and (D) Lys-Asp. 


determination (valuable cross-links) and those that are unlikely to contribute much information 
(non-valuable cross-links). We defined cross-links as valuable if the spacer length is shorter than 
the median distance expected for the given sequence separation by the equations derived in 
section 2.2 on the preceding page (see also Figure [I]). For these calculations, all proteins were 
grouped into 2.5kDa bins. The calculations were performed for cross-linker lengths from 1 A to 
60 A with a step size of 1 A. 


2.5 Estimation of optimal spacer lengths for a given protein molecular weight 

Over all proteins in each molecular weight (MW) bin, the total number of possible distance 
pairs (^possible) as well as the number of distance pairs useful for structure determination 
(^valuable) were computed for each cross-linker spacer length. Furthermore, the maximum 
number of valuable cross-links observed for all spacer lengths (#valuable ma:E ) was determined. 
For each MW bin the ratios (^p^gjf) and ( #^uabie Ie ) were plotted as a function of the 
cross-linker spacer length. The optimal cross-linker length for each MW bin was approximated 
as intersection points of the two functions using a local regression (Figure [2] on page[6]). The 
estimated values for the optimal cross-linker spacer length were plotted as a function of the MW 
and were fitted using a cubic regression curve. The script used for the calculation is available at 
http://www.ufz.de/index.php?en=19910. 


2.6 Simulation of cross-linking restraints 

Seventeen proteins with known tertiary structure determined via X-ray crystallography (resolu¬ 
tion of <1.9 A) were selected from the data set of structures as test cases to evaluate the influence 
of cross-linking restraints on de novo protein structure prediction. To thoroughly benchmark the 
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Protein 

Uniprot 

resolution 

weight 

length 

Lysine 

a-helix 

/3-strand 

1HRC 

P00004 

1.9 A 

12368 Da 

105 

18% 

40% 

1% 

3IV4 

Q7A6S3 

1.5A 

13235 Da 

112 

6% 

49% 

25% 

1BGF 

P42228 

1.5A 

14504 Da 

124 

5% 

79% 

1% 

1T3Y 

Q14019 

1.2A 

15835 Da 

141 

9% 

35% 

29% 

3M1X 

C4LXT9 

1.2A 

15882 Da 

138 

7% 

25% 

28% 

1X91 

Q9LNF2 

1.5A 

16419 Da 

153 

7% 

76% 

0% 

1JL1 

P0A7Y4 

1.3A 

17483 Da 

155 

7% 

34% 

30% 

1MBO 

P02185 

1.6 A 

17980 Da 

153 

12% 

77% 

0% 

2QNL 

Q11XA0 

1.5A 

19218 Da 

162 

5% 

70% 

2% 

2AP3 

Q8NX77 

1.6 A 

23 190 Da 

199 

23% 

81% 

0% 

1J77 

Q9RGD9 

1.5A 

24 226 Da 

209 

8% 

62% 

1% 

1ES9 

Q29460 

1.3A 

25876 Da 

232 

3% 

41% 

11% 

3B50 

D0VWS1 

1.4A 

27506 Da 

244 

3% 

71% 

0% 

1QX0 

P0A2Y6 

2.3A 

32 821 Da 

293 

7% 

38% 

20% 

2IXM 

Q15257 

1.5A 

34 798 Da 

303 

7% 

60% 

3% 

FGF2 

P09038 

1.5A 

17859 Da 

145 

10% 

9% 

34% 

Pll 

P60903 

2.0A 

11071 Da 

95 

13% 

63% 

3% 


Table 1: Proteins used for the cross-link spacer length benchmark. The fifteen proteins for the 
benchmark set were selected from high-resolution structures deposited in the PDB with varying content of 
lysines. The structures were selected to cover a wide range of the structural features sequence length as 
well as percentage of residues within a-helices and ft-strands. 

algorithm, the benchmark set covers a wide range of protein topologies and structural features. 
The sequence lengths of the proteins range from 105 to 303 residues, the number of secondary 
structure elements (SSEs) ranges from 5 to 19 with varying cn-helical and /3-strand content 
(Table [I]). For these proteins, all solvent accessible surface Cp-Cp distances between target 
amino acids in the structure which were within the range of either homobifunctional Lys-reactive 
cross-linkers or heterobifunctional Lys-Asp/Glu reactive cross-linkers were determined through 
Xwalk. For the predicted optimal cross-linker length (read above) and spacer lengths of 2.5 A, 
7.5 A, 17.5 A and 30.0 A lists of structurally possible cross-links were generated. 

For the two proteins horse heart cytochrome c (PDB entry 1HRC) and oxymyoglobin (PDB 
entry 1MBO) restraints were also derived from published cross-linking MS experiments deposited 
in the XL database)- 25 Experimental cross-linking data of FGF2 (PDB entry 1FGA) and Pll 
(PDB entry 4HRE) were derived from Young et a/. 19 and Schulz et al. pH respectively. 


2.7 Translating cross-linking data into structural restraints 


Explicitly rebuilding coordinates for a cross-link is comparable to solving the loop closure 
problem. 32 During de novo, protein structure prediction the cross-link would have to be recon¬ 
structed each time the conformation of the protein changes. In a typical Monte Carlo (MC) 
simulation with a maximum of 12 000 MC steps per model and 5000 models for each protein this 
would result in a maximum number of 60 million attempts to build the cross-link, which is too 
resource demanding for usage in de novo protein structure prediction. Therefore, we developed 
a fast approach to estimate the chance that a particular model fulfills a XL-MS restraint. The 
surface path of a cross-link is approximated by laying a sphere around the protein structure 
and computing the arc length between the cross-linked residues (Figure [6] on page 20). The 
geometrical center of the protein structure is used as the center of the sphere. If takeoff and 
landing point have different distances to the center of the sphere, the longer distance is used as 


5 






the radius. During the protein structure prediction process, the side-chains of the residues are 
not modeled explicitly but represented on a simplified way through a ’super atom’. While this 
simplification vastly reduces the computational demand of the algorithm, it also adds additional 
uncertainty due to the unknown side-chain conformations. The agreement of the model with 
the cross-linking data is quantified by comparing the distance between the cross-linker lengths 
(Ixs + hsi + hs‘2) with the computed arc lengths ( d arc ), with —1 being the best agreement 
and 0 being the worst agreement. To account for the uncertainty of side-chain conformations a 
cosine-transition region of 7 A was introduced (Figure [b] on page 20). 


2.8 Structure prediction protocol for the benchmark set 

The protein structure prediction protocol is 
based on the BCL::Fold protocol for soluble 
proteinsP 3 In a preparatory step, the glsplsse 
are predicted using the SSE prediction methods 
PsiPred 34 ^ and Jufo9D® 5 and an SSE pool is cre¬ 
ated. Subsequently a Monte Carlo Metropolis 
(MCM) energy minimization algorithm draws 
random glsplsse from the predicted SSE pool 
and places them in the three-dimensional space 
(Figure [3] on the following page). Random trans¬ 
formations like translation, rotation or shuffling 
of glsplsse are applied. After each MC step the 
energy of the resulting model is evaluated using 
knowledge-based potentials which, among oth¬ 
ers, evaluate the packing of glsplsse, exposure of 
residues, radius of gyration, pairwise amino acid 
interactions, loop closure geometry and amino 
acid clashes.® 6 Based on the energy difference to 
the previous step and the simulated temperature 
a Metropolis criterion decides whether to accept 
or reject the most recent change. 

The protein structure prediction protocol is 
broken into multiple stages, which differ regard¬ 
ing the granularity of the transformations ap¬ 
plied, and the emphasis of different scoring terms. The first five stages apply large structural 
perturbations, which can alter the topology of the protein. Each of the five stages lasts for a 
maximum of 2000 MC steps. If an energetically improved structure has not been generated 
within the previous 400 MC steps, the stage terminates. Over the course of the five assembly 
stages, the weight of clashing penalties in the total score is ramped up as 0, 125, 250, 375 and 
500. 

The five protein assembly stages are followed by a stage of structural refinement. This stage 
lasts for a maximum number of 2000 MC steps and terminates if no energetically improved 
model is sampled for 400 MC steps in a row. Unlike the assembly stages, the refinement stage 
only consists of small structural perturbations, which will not drastically alter the topology of 
the protein model. 

Through multiple prediction runs with different score weights, the optimal contribution of 
the cross-linking score to the total score was determined to be 40% to 50%. Consequently, the 


25 kDa o #val/#pos »rel # valuable 



Spacer Length [A] 

Figure 2: Cross-link yield in dependence of 
the spacer length. Behavior of valuable and pos¬ 
sible cross-links in the MW bin 25 kDa and lo¬ 
calization of the optimal spacer length. Shown is 
the number of valuable cross-links for every tested 
spacer length in red. These values are normalized 
to a dimension spanning 1. Blue points show the 
share of valuable cross-links among the physical pos¬ 
sible ones. The dotted line meets the intersection 
of both curves and represents the optimal spacer 
length where the best ratio between valuable and 
possible cross-links is attained and the number of 
valuable cross-links is maximized in respect to this 
ratio. 
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weight for the scoring term evaluating the agreement of the model with the cross-linking data 
was set to 300 over all six stages, which ensures that the cross-linking score contributes between 
40% and 50% to the total score. 


2.9 De novo folding simulations without and with cross-linking restraints 


To evaluate the influence of cross-linking re¬ 
straints on protein structure prediction ac¬ 
curacy, each protein was folded in the ab¬ 
sence and in the presence of Lys-Lys, Lys-Glu, 
and Lys-Asp cross-linking restraints. Indepen¬ 
dent structure prediction experiments were 
performed for the predicted optimal as well as 
two shorter and two longer cross-linker spacer 
lengths each of the five spacer lengths (Ta¬ 
ble [2] on page 21). Additionally, predictions 
were performed using combination of all spacer 
lengths as well as using restraints obtained by 
the optimal spacer length of all three cross¬ 
linker reactivities. For the two proteins of 
which experimentally determined cross-linking 
data were available, protein structure predic¬ 
tion was additionally performed for the exper¬ 
imentally determined restraints. For each pro¬ 
tein and cross-linker length used, 5000 models 
were sampled in independent MCM trajecto¬ 
ries. Due to the randomness of the employed 
MC algorithm, ten sets of 5000 models were 
sampled for each protein without restraints. 
Improvements in prediction accuracy can be 
compared to the standard deviations to iden¬ 
tify statistically significant improvements (Ta¬ 
ble [3] on page 21). 


A Virtual Crosslink Analysis 


Modelling Benchmark using BCL::fold 


Cullpdb database of 2055 protein 
structures <= 20% seqlD, Res. 1.6 A, R- 
factor < 0.25 


Secondary Structure prediction by 
consensus approach 



Find spacer length that enriches valuable 
crosslinks 


Metropolis Criterion 


De novo modelling of a benchmark 
set with predicted optimal and none 
optimal crosslink length 


Figure 3: Spacer length and protein structure 
prediction workflow. Workflow for (A) the predic¬ 
tion of optimal cross-linker spacer length and (B) for 
de novo protein structure prediction using BCLr.Fold. 

(A) Workflow for the prediction of the optimal spacer 
length depending on the MW of the protein of interest. 

(B) Workflow for de novo protein structure prediction 
using BCL::Fold. SSEs are predicted using the SSE 
prediction methods PsiPred and Jufo9D. A MCM 
algorithm subsequently searches the conformational 
space for the structure with most favorable score. 


2.10 Metrics for comparing calculating model accuracy and enrichment 

The prediction results were evaluated using the protein-size-normalized root-mean-square- 
deviation (RMSD100) 37 and enrichment 12 36 metrics. The RMSD100 metric was used to quantify 
the sampling accuracy by computing the normalized root-mean-square-deviation between the 
backbone atoms of the superimposed model and native structure. The enrichment metric was 
used to quantify the discrimination power of the scoring function by computing which percentage 
of the most accurate models can be selected by the scoring function. The enrichment metric is 
used to assess the influence of the cross-linking restraints to discriminate among the sampled 
models. First, the models of a given set S are sorted by their RMSD100 relative to the native 
structure. The 10% of the models in S with the lowest RMSD100 are assigned to subset P 
(positives) and the remaining 90% of the models are assigned to subset N (negatives). Second, 
the models in S are sorted by their BCL score. The 10% of the models in S with the best 
score are assigned to subset FS (favorable score). The intersection TP = FS n P contains the 
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most accurate models which the scoring function can select (true positives). The enrichment 
e = x w up of the most accurate models the scoring function can select. In order to 
reduce the influence of the sampling accuracy on the enrichment values, the positive models are 
considered the 10 % of the models with the lowest RMSD100 and ^ P ^p N is fixed at a value of 
10.0. Therefore, the enrichment ranges from 0.0 to 10.0, with a score of 1.0 indicating random 
selection and a value above 1.0 indicating that the scoring function enriches for native-like 
models. 


3 Results 

3.1 Creation of an in silico cross-linking database 

We performed in silico cross-linking experiments on 2055 non-redundant proteins. Covering 
an MW range from 1.4 kDa to 139 kDa, 59% of the proteins have an MW below 25kDa. For 
each of those proteins all Lys-Lys, Lys-Asp, and Lys-Glu sequence and Euclidean distances as 
well as the SAS distance between the C^-atoms were determined. Thus, the resulting database 
contained information on 391 902 Lys-Lys, 395 815 Lys-Glu, and 360 101 Lys-Asp pairs which 
built the basis for the determination of the number of possible cross-links, cross-links useful for 
structure prediction, and finally for the prediction of the optimal cross-linker length for studying 
a selected protein (Figure [5] on the previous page). 

3.2 Estimation of the possible cross-links per protein 

Next we estimated how many and which of the distances could be cross-linked with a cross-linker 
of a given length and specificity. We considered cross-links possible if the sum of the spacer length 
and the length of the two connected side-chains (C^-C^, Lys-Nz-Lys-Nz, Lys-Nz-Asp-C 7 , or 
Lys-Nz-Glu-C,s) is longer than the C^-C^-SAS-distance between the amino acids. As the lengths 
of the side-chains of Lys (C^-Nz), Asp (Cp-Oz), and Glu (C^-Oz) 4.5 A, 2.4 A, and 3.6 A were 
used, which were determined as average values from the crystal structure of calmodulin (PDB 
entry 1CLL). In silico cross-linking experiments were conducted for all of the 2055 proteins using 
homobifunctional Lys-Lys-reactive, as well as heterobifunctional (Lys-Asp- and Lys-Glu-reactive) 
cross-linking reagents with lengths from 1 A to 60 A (step size 1 A). 

To draw conclusions from the correlation of this in silico cross-linking experiments to the MW 
of the studied proteins the proteins were grouped into 45 bins with a step size of 2.5 kDa. For 
example, a protein with a MW in the range of 25 kDa to 27.5 kDa contains on average 15.1 Lys, 
14.4 Asp, and 16.7 Glu. On average, 182 Lys-Lys, 173 Lys-Glu, and 144 Lys-Asp cross-links 
exist per protein within this specific MW bin. Theoretically, all of those could be cross-linked 
with a cross-linker of 60 A. In contrast by utilization of cross-linkers of 13 A (as e.g. BS3) only 
about 33 % of the cross-links are formed in silico. When going to a cross-linker of length of 1A 
(e.g. close to EDC), only 10% of all possible amino acid pairs are linked. 

3.3 Estimation of structurally relevant cross-links 

In protein structure prediction approaches, the enrichment of low RMSD structures among 
thousands of generated models is crucial. Therefore, we hypothesized that restraints that are 
valuable for structure prediction will reduce the conformational search space substantially. For 
the present study, we classify a cross-linking restraint as useful for structure prediction if it 
discriminate at least 50% of all possible conformations. Thus, in a second step each of the 





possible cross-links was evaluated in terms of its potential to discriminate at least 50% of 
incorrect structures (useful for structure determination) or whether the cross-linked amino acids 
are so close in sequence that it can be derived from sequence separation that the distance can 
be bridged by the cross-linker independently of the protein’s structure (not useful for structure 
determination). 

In order to develop a stringent measure for usefulness we did not simply assume the maximum 
distance that can be bridged by an amino acid chain of a certain length. Instead, the Euclidean 
distance distributions for Lys-Lys, Lys-Glu, and Lys-Asp were computed for the sequence 
separations ranging from 1 to 60 amino acids within our database of protein structures. For 
example, in the more than 2000 analyzed structures there are 3132 Lys-Lys pairs, which are 
separated by ten amino acids. For this sequence distance Euclidean distances bins of 2.5 A were 
defined in which the occurrences of residue pairs were counted. The pairs were present in bins 
ranging from 2.5 A to 35.0 A. As the median distance, we found 15.5 A. For the same sequence 
distance the distribution of Lys-Glu (3336 pairs) and Lys-Asp (3010 pairs) are quite similar and 
the median values were 15.6 A and 15.3 A. 

Similarly, for sequence separations of 15 amino acids we observed 3024 Lys-Lys pairs, 3200 
Lys-Glu pairs, and 2835 Lys-Asp pairs. The median values are 20.8 A, 20.9 A, and 20.7 A, 
respectively. For sequence separations of 60 amino acids, we observed 2167 Lys-Lys pairs, 2212 
Lys-Glu pairs, and 2167 Lys-Asp pairs. The median values are 23.0 A, 23.0 A, and 23.0 A, 
respectively (Figure [T] on page [4] . 

Approximating the proteins structures as spheres, we applied a logarithmic model to fit 
the relationship between the sequence separation S and the median Euclidean distance E me d. 
We find (i) E^yg—^yg — o.46 x Lr( ^Lys-Lys) T 2.2, (ii) ELy S —Qi u — o.37 x Ifii^S^yg—Qh/) T 2.36, 
and (iii) El V s-Asp = 5.19 x ln(SL ys -Asp ) + 2.36 for Lys-Lys, Lys-Glu, and Lys-Asp distances, 
respectively. 

Secondly, using our derived functions constituting the S/E relationships, we considered every 
cross-link as of reasonable discriminative power, i.e. which fulfills the criterion that the sum of 
the cross-linker spacer length and the average length of both contributing side-chains is shorter 
than the median of the sequence/Euclidean-distance distribution. If we examine the 25kDa 
MW bins of Lys-Lys targets with a 1A spacer cross-link 1167 of the possible 22 398 target pairs 
fulfilled this criterion and were considered as of sufficient discriminative power (Figure [ 7 ] on 
page 22). These cross-links, which represent 4% of all Lys-Lys distances we defined therefore 
as useful for protein structure prediction. Application of a 13 A spacer length results in 2935 
valuable target pairs (12% of all Lys-Lys distances, see Figure [ 7 ] on page [22]). In contrast, a 
cross-linker with a spacer length of 60 A would allow to cross-link all distances. However, none 
of the cross-links would have discriminative power for native-like models (Figure [ 7 ] on page 22). 
For the proteins of the 25 kDa MW bins the number of valuable cross-links as a function of 
the cross-linker length has a log-normal character never exceeding a roughly 25 A spacer. The 
intermediate length of 13 A resulted in an almost equal contribution of valuable and structurally 
invaluable cross-linking pairs. Whereas 29% of all possible reactive amino acid pairs are linked, 
12% are considered valuable according for structure prediction (Figure [ 7 ] on page 22). 


3.4 Prediction of molecular weight dependent optimal cross-linker spacer lengths 

Whereas usage of a short cross-linker will result in only a few but mostly structurally valuable 
restraints, a longer cross-linker will yield more restraints but a lower ratio of valuable restraints. 
Furthermore, the ratio of valuable restraints as well as the number of possible restraints depends 
on the size of the protein. In agreement with prior studies regarding structural modeling driven 
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by sparse distance restraints,^ 8 we hypothesize that a compromise between maximizing the 
portion of valuable cross-links compared to all cross-links which can be formed with a given 
cross-linker length ( ^possible ) an< ^ max i m i z i n g the relative number of achievable cross-links with 

any spacer length () m ight yield the best results. 

Following our hypothesis, for each MW bin we derived the optimal spacer length as the 
intersection point of the two functions as it is shown exemplarily for MW 25kDa in figure [2] on 
page [6] 

The derived optimal spacer lengths for Lys-Lys, Lys-Asp, and Lys-Glu were plotted as function 
of the MW (Figure [4] on the following page). The relationship was fitted using a cube root 
function. For our observable MW sample space for Lys-Lys cross-links, all spacer lengths reached 
dimensions between 5.0 A and 18.6 A. No optimal spacer length was further than 2.5 A separated 
from the regression curve. The average distance from the modeled spacer lengths was 0.7 A. 
The MW term as well as the side-chain term has been modeled as an exponential fraction in 
respect to the relation between volume and distances in spherical objects. 

Additionally, the optimal spacer lengths were also predicted for homobifunctional arginine 
and for homobifunctional cysteine cross-linking reagents analogously to the procedure being 
described for the homo- and heterobifunctional lysine-containing cross-links. Consistently, the 
optimal spacer lengths depend on the MW as well as the lengths of the cross-linked side-chains 
SSI and SS2 and could be calculated by l op t[ A] = k x \/MW + y/SSI + SS2. k was determined 
to be 0.32, 0.31, 0.34, 0.34 and 0.35 for Lys-Lys, Lys-Asp, Lys-Glu, Arg-Arg, and Cys-Cys, 
respectively. 


3.5 Generation of in silico and experimental cross-linking data for testing the 
effect of different spacer length for de novo modeling 

To evaluate the effect of cross-linking data derived from experiments with different spacer length 
we folded seventeen proteins de novo with BCL::fold (Figure [3] on page [7]). Thirteen proteins 
were part of our data set while for four proteins experimental cross-linking data were available 
(1MBO, 1HRC, 1FGA, and 4HRE) (Table [T] on page §. All proteins have a MW in the range 
from 13kDa to 27kDa. Most structures were mainly a-helical with fewer /3-strand SSEs. The 
/3-strand content ranged from 0% to 51%. The a-helical content ranges from 2% to 81%. The 
highest /3-strand content showed 1LMI with also the fewest a-helices. The portion of lysines 
was between 3% and 23%, which resulted in minimal 4 and maximal 46 lysine residues per 
structure. For the two structures 1MBO and 1HRC, which were studied experimentally, we used 
the published experimental data, which were obtained using DSG, DSS/BS3, and DEST. 25 For 
1MBO, there were 8 cross-links in total with the 11.4 A reagent BS3 four of them confirmed 
with the 7.7 A reagent DSG. For 1HRC, 48 cross-links were reported. 9 DSS, 31 BS3, 6 DSG, 
and 9 with DEST (11 A). Six cross-links had been identified with different cross-linking reagents. 
18 BS3 cross-links were published for 1FGA, 19 whereas 3 intramolecular BS3 cross-links were 
available for 1HRE. 31 For the thirteen proteins as well as for 1MBO and 1HRC, we predicted 
all cross-links, which are possible with the predicted optimal cross-linker length as well as with 
two shorter and two longer cross-linking reagents (Table [2] on page 21) and used these data as 
restraints during modeling (Figure [6] on page [20|). 


3.6 Cross-linking restraints improve the sampling accuracy of de novo protein 
structure prediction 
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XL-MS data provides structural restraints that 
reduce the sampling space in de novo structure 
determination. Thereby a fraction of incorrect 
conformations is excluded and the sampling 
density in all other areas of the conformational 
space is increased. To evaluate the power of 
cross-linking restraints to guide de novo pro¬ 
tein structure determination we computed the 
RMSDIOCP^ values of the most accurate mod¬ 
els for each protein for structure prediction 
with and without cross-linking restraints. Us¬ 
ing cross-linking restraints not only increases 
the frequency with which accurate models are 
sampled, but the best models achieve an ac¬ 
curacy not sampled in the absence of cross- 
linking data (Table [3] on page 21). Across all 
benchmark proteins, the accuracy of the best 
models was, on average, 6.6 A when no cross- 
linking data was used. By using cross-linking, 
data for the spacer length deemed optimal 
the average RMSD100 value was improved to 
5.6 A, which corresponds to two standard de¬ 
viations. By using restraints obtained for all 
five spacer lengths, the average accuracy of 
the best model improved to 5.2 A. For the 
proteins 1XQ0, 2IXM, and 3B50, even with 
cross-linking data, it was not possible to sam¬ 
ple a native-like conformation. We attribute 
this to limitations in the sampling algorithm 
resulting in the native conformation not being 
part of the sampling space. For other proteins, 
significant improvements could be observed. 
While the accuracy of the best models for 1ES9 
and 1J77 was 7.3 A and 6.8 A, cross-linking re¬ 
straints improved the accuracy to 5.7 A and 
4.5 A, respectively. For 1MBO, the accuracy 
could be improved from 7.1 A to 4.2 A by using 
a combination of Lys-Glu/Asp reactive cross- 



Figure J t : Relationship between sequence dis¬ 
tance and Euclidean distance. Functions repre¬ 
senting the relationship between sequence (S) and 
spatial distance (E). The equations approximated by 
method of least squares to a logarithmic equation for 
(A) Lys-Lys, (B) Lys-Glu, and (C) Lys-Asp. 


linkers (Figure [8] on page 23). 


3.7 Cross-linking restraints improve the discriminative power of the scoring 
function 

The ability of the scoring function to identify the most accurate models among the sampled 
ones was quantified using the enrichment metric (see section [2] on page [3]) . Enrichments were 
computed for proteins predicted without cross-linking data, for each spacer length and for all 
spacer lengths combined. For protein structure prediction without cross-linking restraints an 
average enrichment of 1.1 was measured, which is barely better than random selection. The 
scoring function has almost no discriminative power. Using cross-linking restraints yielded by the 
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Figure 5: Protein structure prediction results for various spacer lengths. Cross-linking data 
improve prediction accuracy and discrimination power. Using geometrical restraints derived from cross- 
linking experiments reduces the size of the conformational space, which needs to be searched for the 
conformation with the lowest free energy. This results in a higher likelihood of sampling accurate models 
and an improved discrimination power of the scoring function. Panel A compares the RMSD100 values 
of the most accurate model for structure prediction from different spacer lengths to the results for the 
optimal spacer length (horizontal line). Panel B compares the enrichments for different spacer lengths 
likewise. 
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optimal spacer length improved the enrichment to 2.1 (Table [3] on page 21), which corresponds 
to three standard deviations. Using all five spacer lengths to obtain additional restraints further 
improves the enrichment to 2.4. The most significant improvement could be observed for 1J77, 
for which the enrichment could be improved from 0.5 to 2.4. 


3.8 The cross-linker length determines improvements in sampling accuracy and 
discrimination power 

The length of the cross-linker determines the number of obtainable restraints as well as their 
information content. 10 While a longer cross-linker is able to form more cross-links and therefore 
yields a larger number of restraints, the longer cross-linker length can be fulfilled by a larger 
number of conformations, reducing the discriminative power of the restraint. In order to assess 
the influence of the cross-linker length, and therefore the number of restraints and restraint 
distances, on the sampling accuracy and discrimination power, the protein structure prediction 
protocol was conducted with restraints derived from different cross-linker lengths. 

The cross-linker lengths were separated into five groups: optimal, which is the predicted 
optimal cross-linker length, shortl and short2 , which are shorter cross-linker lengths, and longl 
and long2, which are longer cross-linker lengths. The cross-linker length predicted to be optimal 
yielded the most useful restraints for protein structure prediction in terms of sampling accuracy 
and discriminative power. Across all proteins the average RMSD100 values of the most accurate 
models for the optimal cross-linker length were 5.6 A — an improvement by 15% — while 
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they were 6.3 A, 6.2 A, 5.9 A and 6.1 A — improvements by 5%, 6%, 11% and 8% — for the 
shorter and longer cross-linker lengths, respectively (Figure [5] on the previous page). The longest 
cross-linkers have a less significant impact on the sampling accuracy due to their ambiguity, 
whereas the shortest cross-linkers failed to yield a sufficient number of distance restraints to 
impact prediction accuracy. The discriminative power, quantified through the enrichment 
metric, for the optimal cross-linker length was 2.1, whereas it was 1.4, 1.5, 1.9 and 1.7 for the 
shorter and longer cross-linkers, respectively (Figure [5] on the preceding page). For the proteins 
1X91 and 3M1X, the optimal cross-linker length did not yield any cross-links with a sequence 
separation of at least ten and therefore did not provide relevant structural information. In those 
cases protein structure prediction with longer cross-linker lengths provided better results. By 
combining restraints obtained for all five cross-linker lengths, the average enrichment value could 
be improved to 2.4. 


3.9 Combination of cross-linkers with different reactivities results in 
improvements larger than seen when varying the spacer lengths 

In order to obtain valuable restraints for de novo protein structure prediction a maximum number 
of SSE pairs needs to be cross-linked. The availability of Lys-Asp/Glu reactive cross-linkers 
allows for a better sequence coverage and therefore a wider coverage of SSE pairs. Cross-links 
with different spacer lengths were simulated for the proteins in the benchmark set using Xwalk. 
To assess the influence of Lys-Asp/Glu reactive cross-linkers on protein structure prediction the 
same protocol was applied as for the Lys-Lys reactive cross-linkers. For the Lys-Glu reactive 
cross-linkers a prediction accuracy of 5.7 A and enrichment of 2.2 on average could be achieved, 
which is comparable to the results for the Lys-Lys reactive cross-linkers. 

While Lys-Asp reactive cross-linkers also achieve improvements in prediction accuracy and 
enrichment when compared to protein structure prediction without restraints, the results are 
slightly worse than for Lys-Lys reactive cross-linkers with an average prediction accuracy of 6.0 A 
versus 5.6 A and an average enrichment of 1.7 versus 2.1 (Table [3] on page 21). To a large part, 
the difference in the overall results is caused by the results for the proteins 1ES9, 1T3Y, and 
3M1X for which Lys-Asp reactive cross-linkers failed to yield restraints between SSE pairs and 
there- fore failed to reduce the conformational space significantly. Besides deviations regarding 
the average improvements over all proteins, the spacer length deemed optimal also provides 
the best results for Lys-Asp/Glu reactive cross-linking. Combining the restraints yielded for 
the optimal spacer lengths with Lys-Lys/Asp/Glu reactive cross-links improves the sampling 
average sampling accuracy to 5.1 A and the average enrichment to 2.6. Combining the restraints 
yielded by all spacer lengths and cross-linker reactivities failed to further improve prediction 
results. 


4 Discussion 

4.1 Prediction of the optimal cross-linker spacer length 

It has been shown frequently that chemical cross-linking data can be used to guide de novo 
structure prediction and selection of native-like models. Surely, the sensitivity, the broad 
applicability to almost all proteins, the nearly physiological experimental condition during the 
chemical cross-linking reaction, and the potential of automation are the main advantages for 
using XL-MS to generate such restraints. However, the small number and high uncertainty of 
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restraints from chemical cross-links limit impact on de novo proteins structure prediction, in 
particular when compared to more data rich methods such as NMR spectroscopy.^ 

One major limitation is the fact that distances between the functional groups in long and 
flexible amino acid side-chains are measured. Therefore, a significant uncertainty is added to 
the cross-linker length when converting XL-MS data into Cp-Cp restraints. A second challenge 
of chemical cross-links is that only the maximum distance is restricted, but no information is 
obtained on the minimum distance or the favored distance distribution. In result, even a “zero 
length” cross-linker restricts the Cp-Cp distance to the sum of the lengths of the two connected 
side-chains (e.g. for Lys-Lys cross-links 9.1 A). 

In most of the cross-linking experiments, lysine residues are targeted. Lysines are excellent 
targets because of their overrepresentation on protein surfaces and the clean chemistry of amine 
modification. Nevertheless, their frequency is on average only about 7%. As a consequence the 
number of cross-links, which can be formed for example in a 25 kDa protein with a standard 
homobifunctional Lys-Lys-reactive cross-linking reagents with a spacer length of 6.4 A (length 
of DST) are in the range of about 20. Only a small fraction of these restraints will substantially 
limit the conformational space for the protein. This number is usually too small to restrict the 
conformational space to an unambiguous single protein fold. To increase the number of restrains 
it is possible to use cross-linkers with longer spacer length or target amino acids such as Asp, 
Glu, Tyr, Ser, Thr, Arg, or Cys. 

Restraints obtained with longer cross-linking reagents are less restrictive to the conformational 
space. To evaluate the value of cross-links for protein structure prediction we determined for 
each sequence distance (0 to 60 amino acids) how long a cross-linker has to be to link the target 
amino acids. For example two lysines, which are separated by eight amino acids in sequence 
were found to be linkable in all 3488 cases by a homobifunctional cross-linker with a length of 
>30 A (as it is in BS(PEG)9). In our study, we stated the hypothesis that it would be desirable 
if two target amino acids can only be linked in 50 % of all models created meaning that 50 % 
of all structures could be discarded based on a single cross-link. For example, for two lysines 
separated by 10 amino acids this would be the case for cross-linker lengths of 14.8 A (distance 
distributions for other amino acids distances are shown in figure [I] on page [4]) . Cross-links, which 
could only be formed in less than 50 % for the corresponding sequence distance, were considered 
as being valuable. Based on this definition for all 2055 structures of the applied non-redundant 
protein structure database the optimal spacer length was calculated. With this optimal spacer 
length, the number of structural valuable cross-links has been maximized taking into account 
that in general for modeling approaches few distance restraints of highly discriminative character 
are less favorable than a higher number with a smaller discriminative powerJHHH 

Since the optimal cross-linker length should depend on the protein size in a cubic root 
fashion to convert volume into distance, it is not unexpected that this was also observed for 
the dependency on the MW (Figure [I] on page 11). However, one has to keep in mind that the 
formula might not be applicable to non-globular proteins and multi-domain proteins. However, 
in case of multidomain proteins this formula should be applicable to the separated domains. 
Remarkably, based on our simulation for proteins with MWs of 10 kDa, 25 kDa, 50 kDa and 
100 kDa the recommended spacer lengths are 9.0 A, 11.5 A, 13.9 A and 17.0 A, respectively, which 
is quite close to the homobifunctional amine-reactive commercially available cross-linkers DSG 
(7.7 A), BS3 (11.4 A), and EGS (16.1 A), which are currently the preferred choice to study small 
(<20kDa), medium (20kDa to 50kDa), and large proteins (>50kDa), respectively. 

Addressing different functional groups is a second approach to increase the total number of 
distance restraints. The consequence is that the cross-linking reaction is either less effective or 
specific (Asp, Glu, Tyr, Ser, Thr) creating challenges in data interpretation or the target amino 
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acids are less frequent (Arg and Cys) limiting the number of restraints observed. However, using 
the same theoretical approach revealed that optimal spacer length for heterobifunctional Lys-Asp 
and Lys-Glu cross-linker (Figure[8]on page 23) as well as homobifunctional Cys-Cys and Arg-Arg 
cross-linker can be predicted with the same equation: l 0 pt[&] = k x \fMW+ $SS1 + SS2 with 
k ~ ^ in which SSI and SS2 are the average lengths of the cross-linked side-chains. 

Comparing the two approaches to increase the number of valuable cross-links, it should 
be pointed out that using several cross-linking reagents with different reactivities results in 
significantly higher improvement of the model quality than using only lysine reactive cross-linking 
reagent but with different spacer length. 


4.2 Challenges in using cross-linking data to guide de novo modeling 

To test whether the cross-linker with the predicted optimal spacer length indeed perform best in 
modeling we have chosen a de novo structure prediction approach BCL::Fold for testing. Even 
though comparative modeling using known protein structures as a template usually performs 
better then de novo modeling, our goal was to maximize impact of XL-MS restraints. 

A major limiting factor for de novo protein structure prediction methods is the vast size of 
the conformational space. Cross-linking restraints can aid the computational prediction of a 
protein’s tertiary structure by drastically reducing the size of the sampling space. Cross-linking 
experiments yield a maximum Euclidean distance between the cross-linked residues, which 
increases the sampling density in the relevant part of the conformational space. 

A major limitation of using cross-linking restraints to guide protein structure prediction when 
compared to restraints obtained from EPR and NMR spectroscopy is that the cross-linker length 
cannot be directly translated into a Euclidean distance between the cross-linked residues. While 
cross-link prediction and evaluation methods like Xwall^ are successful at predicting if a certain 
cross-link can be formed in a given structure, explicit modeling approaches are computationally 
too expensive for usage in a rapid scoring function required for protein structure prediction. 
Approximations, such as the great circle on a sphere presented here, are fast to compute but 
associated with increased uncertainties. Most of the cross-linkers used can cover a long Euclidean 
distance and therefore the yielded restraints can be fulfilled by a wide variety of conformations. 
In spite of this, cross-linking restraints display some potential in limiting the size of the sampling 
space, resulting in a higher density of accurate models. Further, the geometrical restraints 
derived from XL-MS allow for the discrimination of a significant fraction of models representing 
incorrect topologies and therefore improve the discriminative power of the scoring function. 

4.3 Abilities and limitations of protein structure prediction from limited 
experimental data 

We showed that incorporation of cross-linking data into a de novo protein structure prediction 
method improves the accuracy of the structure prediction. The two major challenges of de novo 
predictions are the sampling of structures as well as the discrimination of inaccurate structures. 
In this study reduction of the conformational space was achieved through the assembly of 
predicted SSEs with limited flexibility and the incorporation of geometrical restraints derived 
from cross-linking data. The discrimination of inaccurate models is performed through a scoring 
function which approximates the free energy. Assuming that the native structure is in the 
global energy minimum, complete sampling and an accurate methods to measure free energy 
would lead to the correct identification of the native conformation. However, the conformational 
space is too large to be extensively sampled and the free energy needs to be approximated, 
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which results in ambiguity regarding the model which is most similar to the native structure. 
Incorporating cross-linking data provides geometrical restraints, which can be used as additional 
criteria to discriminate inaccurate models. While an average sampling accuracy of 5.1 A, when 
using restraints yielded by XL-MS, is a significant improvement over the 6.6 A, when not using 
cross-linking data at all, only for four proteins it was possible to sample models with an RMSD100 
of less than 4 A when compared to the crystal structure. Cross-linking data yields an upper 
boundary for the Euclidean distance of the cross-linked residues, which allows for the placement 
of the second residue within a sphere of volume |vrr 3 around the first residue. Depending 
on the cross-link distribution, topologically different models can fulfill the same restraint set. 
Discrimination among those models is impossible with XL-MS restraints. 

4.4 Comparison of experimental and in silico cross-links 

In order to draw general conclusion based on the analysis of hundreds of different structures 
this study relies mainly on virtual cross-linking experiments. Unfortunately, although extensive 
XL-MS data sets have been published for several proteins, it proved difficult to obtain suitable 
experimental data sets for the present benchmark due to additional requirements: (i) the protein 
must be monomeric and small enough for de novo protein folding with BCL::Fold, (ii) an 
experimental atomic detail structure for comparison, and (iii) a large data set of intramolecular 
cross-links must be available. Results for the four cases Pll, FGF2, cytochrome c, and 
oxymyoglobin that came closest are reported to demonstrate our efforts to work not only 
with simulated data. However, for Pll and FGF2 using experimentally determined restraints 
did not improve the prediction results in a statistically significant way. For Pll, only three 
restraints were available with a maximum sequence separation of nine residues. Because of the 
small sequence separation, these restraints contain very limited structural information and no 
improvement in de novo folding can be expected. The tertiary structure of FGF2 contains twelve 
/3-strands with several /3-strands that are strongly bent. This protein is too large for de novo 
structure determination with BCL::Fold. As BCL::Fold is unable to sample the conformation 
of the protein in the first place, no significant improvement was expected or observed when 
XL-MS data were added. Nevertheless, the value of the predicted cross-links in comparison to 
experimental cross-links could be validated with the two proteins cytochrome c and oxymyoglobin 
for which experimental cross-links had been published in the XL database. 16 For cytochrome 
c (PDB entry 1HRC), we indeed found that the cross-linker with predicted optimal spacer 
length of 10.2 A performed best. However, for oxymyoglobin (PDB entry 1MBO) the longer 
spacers improved the accuracy slightly more than the cross-linker with the optimal spacer length. 
Interestingly, on the one hand for both proteins several cross-links, which should be possible, 
could not be detected, which might be due to experimental or analytical reasons. On the other 
hand, also several cross-links, which were experimentally, identified which were not predicted. 
An examination of these data revealed that most of these cross-links are not present in the 
virtual data set because their Cp-Cp distances exceed the expected maximum length. This 
finding is in agreement with Merkley et al .,® who evaluated protein structures by molecular 
dynamics and reported that usually a high number of experimental approved cross-links exceed 
the theoretical maximal spatial distance due to structure flexibility. It was concluded for the 
investigation of Lys-Lys distances using a BS3/DSS cross-linking reagent an upper bound of 
26 A to 30 A for C Q -atoms. 39 

On the other hand, spacer conformations usually adapt lengths that are somehow distributed 
between their minimal and maximal lengths. In line it was also reported that many spacers in 
commercially available cross-link agents preferable adopt conformations, which are significantly 
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below the cited maximal spacer length. 40 Thus, ideally cross-linking results should be evaluated 
based on experimentally derived or simulated ensembles of in-solution structures instead of 
using X-ray structures as reference. However, to address all degrees of flexibility during the 
de novo structure prediction is currently too resource intensive. Furthermore, there are many 
additional practical challenges, which may prevent the formation or identification of cross-links, 
and thus may result in more meaningful results using a cross-linker with a non-optimal length. 
Nevertheless, for both structures the sampling accuracies could also be improved by 0.7 A based 
on the experimentally determined restraints, which is only slightly worse than the improvement 
of 1.0 A observed based on in silico cross-links. 


5 Conclusion 

Recent development of high-resolution MS instruments enables the analysis of proteins not 
accessible to NMR spectroscopy and X-ray crystallography. Data obtained from those experi¬ 
ments can be translated into structural restraints to guide protein structure prediction. The 
information content of a geometrical restraint obtained from XL-MS experiments is directly 
dependent on the used spacer length. Thus, the choice of the spacer length is an important step. 

Firstly, for amino acids pairs close in sequence only minimum structural information is 
obtained if the spacer is too long. Here we determine the optimal spacer length to gain 
structural information on lysines with a sequence separation of S, we estimated a length as 
E = 5.5 x ln(S) + 2.2. Secondly, we demonstrate that for de novo protein structure prediction 
the optimal spacer length depends on the MW of the protein of interest and the length of the 
cross-linked side-chains (SSI and SS2 ) and can be predicted as l op t = k x y/MW + y/SSI + SS2, 
with k ~ ^. 

We also demonstrate that restraints obtained from cross-linking experiments contribute 
moderately to solving the major challenges of de novo protein structure prediction — the vast 
size of the conformational space and discrimination of inaccurate models. Using restraints from 
cross-linking experiments significantly increases the sampling density of native-like models and 
contribute to the discrimination of incorrect models. By combining cross-linking restraints 
with knowledge-based scoring functions, the average accuracy of the sampled models could be 
improved by up to 2.2 A and the average enrichment of accurate models could be improved from 
11% to 24%. 

Conclusively, we believe this study can help in the planing of XL-MS experiments as well as 
to evaluate how much information can be gained by XL-MS experiments and the ambiguity that 
remains. 
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7 Supplementary data 


A 




Figure 6: Implicit translation from cross-linking data into structural restraints. Explicit 
simulation of the cross-linker conformation is computationally expensive and prohibitive for use in a 
rapid scoring function required for protein structure prediction. Instead, the cross-linker conformation 
and the path crossed by the cross-linker were approximated through computing the arc length connecting 
the two cross-linked residues (A). The agreement of a model with cross-linking data was evaluated by 
computing the difference between the arc length (d arc ) and the cross-linker length (d x i). The agreement of 
the model with the cross-linking data is quantified with a score between —1 and 0, with —1 being the best 
agreement and 0 being the worst agreement (B). 
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Protein 

optimal 

shortl 

short2 

longl 


long2 


length 

#rest 

length 

#rest 

length 

#rest 

length #rest 

length jfrest 

1HRC 

10.2A 

13 

2.5A 

0 

7.5 A 

7 

17.5A 

27 

30 A 

107 

3IV4 

10.4A 

5 

2.5A 

2 

7.5 A 

2 

17.5A 

7 

30 A 

13 

1BGF 

10.7 A 

6 

2.5A 

3 

7.5 A 

4 

17.5A 

10 

30 A 

13 

1T3Y 

10.9A 

35 

2.5A 

9 

7.5 A 

20 

17.5A 

42 

30 A 

63 

3M1X 

10.9A 

1 

2.5A 

0 

7.5 A 

0 

17.5A 

5 

30 A 

19 

1X91 

11 . 0 A 

2 

2.5A 

0 

7.5 A 

1 

17.5A 

8 

30 A 

27 

1JL1 

11.2A 

7 

2.5A 

0 

7.5 A 

3 

17.5A 

11 

30 A 

24 

1MBO 

11.3A 

9 

2.5A 

0 

7.5 A 

3 

17.5A 

23 

30 A 

77 

2QNL 

11.5A 

6 

2.5A 

4 

7.5 A 

4 

17.5A 

8 

30 A 

15 

2AP3 

12.1 A 

53 

2.5A 

0 

7.5 A 

19 

17.5A 

136 

30 A 

427 

1J77 

12.2 A 

29 

2.5A 

7 

7.5 A 

16 

17.5A 

36 

30 A 

70 

1ES9 

12.5A 

8 

7.5A 

0 

17.5 A 

1 

37.5A 

17 

45 A 

20 

3B50 

12.7 A 

15 

7.5A 

2 

17.5 A 

8 

37.5A 

21 

45 A 

25 

1XQ0 

13.3 A 

9 

7.5A 

0 

17.5 A 

4 

37.5A 

14 

45 A 

44 

2IXM 

13.5 A 

41 

7.5A 

20 

17.5 A 

41 

37.5A 

49 

45 A 

57 


Table 2: Lys-Lys cross-links yielded by different spacer lengths. Cross-links obtained for the 
benchmark proteins. Simulated and experimentally determined cross-links were obtained for the fifteen 
benchmark proteins. For each protein, an optimal spacer length was determined (optimal). Additional 
cross-links were simulated for two shorter (shortl and short2) and two longer (longl and long2) spacer 
lengths. The number of yielded cross-links (f)rest) is shown for each spacer length. For the two proteins 
1HRC and 1MB0, experimentally determined cross-links were published. 


Protein 

Without restraints 

Optimal Lys/Lys 

All Lys/Lys 

All reactivities 

best 

^best 

e 

cr e 

best 

e 

best 

e 

best 

e 

1HRC 

4.5 A 

0.3 A 

0.8 

0.1 

3.8 A 

2.0 

3.8 A 

2.0 

3.7A 

5.9 

3IV4 

6.7A 

0.2 A 

1.2 

0.3 

5.7A 

2.5 

5.3A 

2.5 

5.2A 

1.9 

1BGF 

6.6 A 

0.4 A 

1.0 

0.2 

5.7A 

2.1 

4.9 A 

2.4 

6.2A 

1.6 

1T3Y 

7.0 A 

0 . 7 A 

1.7 

0.4 

6.4A 

2.9 

5.7A 

3.0 

6.2A 

2.3 

3M1X 

3.8 A 

0 . 1 A 

0.7 

0.2 

3.8 A 

0.7 

3.6 A 

1.5 

3.6 A 

1.7 

1X91 

4.8 A 

0.2 A 

2.0 

0.5 

4.8 A 

2.0 

2.0A 

3.2 

2.lA 

3.5 

1JL1 

6.4A 

0.4 A 

1.2 

0.1 

5.6 A 

2.1 

5.3A 

2.8 

5.lA 

2.7 

1MB 0 

7.1 A 

0.6 A 

0.8 

0.3 

6.4A 

2.0 

6.5A 

1.6 

4.2 A 

2.5 

2QNL 

7.0 A 

0.6 A 

1.0 

0.3 

4.8 A 

1.9 

4.1 A 

2.1 

6.lA 

2.1 

2AP3 

2 . 5 A 

0 . 1 A 

1.6 

0.5 

2.0A 

3.0 

1.6 A 

3.1 

2.2A 

2.0 

1J77 

6.8 A 

0.3 A 

0.5 

0.2 

5.0A 

2.0 

4.0 A 

2.4 

3.8 A 

3.2 

1ES9 

7.3 A 

0.8 A 

1.1 

0.6 

5.7A 

2.1 

5.6 A 

2.8 

6.3A 

2.9 

3B50 

9.2A 

0.9 A 

1.4 

0.2 

8.6 A 

1.9 

9.0A 

2.6 

7.1 A 

1.9 

1XQ0 

9.9A 

1 . 0 A 

1.1 

0.3 

8.3A 

1.9 

8.5A 

2.4 

7.4 A 

2.1 

2IXM 

9.4A 

0.9 A 

1.1 

0.4 

7.9A 

1.7 

8.5A 

1.7 

7.0 A 

1.9 

0 

6.6 A 

0.5 A 

1.1 

0.3 

5.6 A 

2.1 

5.2A 

2.4 

5 . 1 A 

2.6 


Table 3: Protein structure prediction results for different cross-linker reactivities. Compar¬ 
ison between structure prediction residts with and without cross-linking restraints. By using geometrical 
restraints obtained from cross-linking experiments, the size of the sampling space can be reduced resulting 
in an improved sampling accuracy. This is shown by significant improvements in the RMSD100 value of 
the most accurate model (best). Furthermore, cross-linking restraints provide geometrical information, 
which improves the discrimination power of the scoring function, leading to an improvement in the 
enrichment (e). Without restraints, ten independent prediction trajectories were conducted and the 
standard deviation of the accuracy of the best model (crbest) and the enrichment (a e ) are reported. 
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Figure 7 : Lys-Lys pair distributions. Distribution of all possible and valuable Lys-Lys pairs for a 
25 kDa to 27.5 kDa weight bin. Gray bars show all theoretical pairs in their specific distance cluster of 
± 2.5 A. Red bars show pairs that could be connected in respect to their surface distance by a specific 
cross-link (here 1 A, 13 A and 60 A.) always including the side-chain contribution to the overall length. 
Green bars show pairs that are considered valuable by our proposed scoring function. Pie charts show the 
accumulated number of cross-links for every spacer length. 
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Figure 8: Selected prediction results from cross-linking data. Most accurate models sampled with 
and without using cross-linking restraints. The RMSD100 values of the most accurate models sampled for 
1X91, 1J77, and 1MB0 were 4.8 A, 6.8 A and 7.1 A. By using restraints yielded by Lys-Lys/Asp/Glu 
reactive cross-linkers, the accuracy could be improved to 2.7 A, 5.0 A and 4.2 A. Shown are the native 
structures of 1X91, 1J77, and 1MBO (A, D, G), the most accurate models sampled without cross-linking 
restraints (B, E, H), and the most accurate models sampled with cross-linking restraints (C, F, I). Selected 
restraints are shown that are not fulfilled in the model predicted without cross-linking data (red bars), but 
that are fulfilled in the model predicted with cross-linking data (black bars). 
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